141 research outputs found

    Audio-assisted movie dialogue detection

    Get PDF
    An audio-assisted system is investigated that detects if a movie scene is a dialogue or not. The system is based on actor indicator functions. That is, functions which define if an actor speaks at a certain time instant. In particular, the crosscorrelation and the magnitude of the corresponding the crosspower spectral density of a pair of indicator functions are input to various classifiers, such as voted perceptrons, radial basis function networks, random trees, and support vector machines for dialogue/non-dialogue detection. To boost classifier efficiency AdaBoost is also exploited. The aforementioned classifiers are trained using ground truth indicator functions determined by human annotators for 41 dialogue and another 20 non-dialogue audio instances. For testing, actual indicator functions are derived by applying audio activity detection and actor clustering to audio recordings. 23 instances are randomly chosen among the aforementioned 41 dialogue instances, 17 of which correspond to dialogue scenes and 6 to non-dialogue ones. Accuracy ranging between 0.739 and 0.826 is reported

    Audio-assisted movie dialogue detection

    Get PDF
    An audio-assisted system is investigated that detects if a movie scene is a dialogue or not. The system is based on actor indicator functions. That is, functions which define if an actor speaks at a certain time instant. In particular, the cross-correlation and the magnitude of the corresponding the cross-power spectral density of a pair of indicator functions are input to various classifiers, such as voted perceptions, radial basis function networks, random trees, and support vector machines for dialogue/non-dialogue detection. To boost classifier efficiency AdaBoost is also exploited. The aforementioned classifiers are trained using ground truth indicator functions determined by human annotators for 41 dialogue and another 20 non-dialogue audio instances. For testing, actual indicator functions are derived by applying audio activity detection and actor clustering to audio recordings. 23 instances are randomly chosen among the aforementioned 41 dialogue instances, 17 of which correspond to dialogue scenes and 6 to non-dialogue ones. Accuracy ranging between 0.739 and 0.826 is reported. © 2008 IEEE

    Towards Emotion Recognition: A Persistent Entropy Application

    Full text link
    Emotion recognition and classification is a very active area of research. In this paper, we present a first approach to emotion classification using persistent entropy and support vector machines. A topology-based model is applied to obtain a single real number from each raw signal. These data are used as input of a support vector machine to classify signals into 8 different emotions (calm, happy, sad, angry, fearful, disgust and surprised)

    Speech Emotion Recognition Considering Local Dynamic Features

    Full text link
    Recently, increasing attention has been directed to the study of the speech emotion recognition, in which global acoustic features of an utterance are mostly used to eliminate the content differences. However, the expression of speech emotion is a dynamic process, which is reflected through dynamic durations, energies, and some other prosodic information when one speaks. In this paper, a novel local dynamic pitch probability distribution feature, which is obtained by drawing the histogram, is proposed to improve the accuracy of speech emotion recognition. Compared with most of the previous works using global features, the proposed method takes advantage of the local dynamic information conveyed by the emotional speech. Several experiments on Berlin Database of Emotional Speech are conducted to verify the effectiveness of the proposed method. The experimental results demonstrate that the local dynamic information obtained with the proposed method is more effective for speech emotion recognition than the traditional global features.Comment: 10 pages, 3 figures, accepted by ISSP 201

    Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema

    No full text
    In this paper, a psychologically-inspired binary cascade classification schema is proposed for speech emotion recognition. Performance is enhanced because commonly confused pairs of emotions are distinguishable from one another. Extracted features are related to statistics of pitch, formants, and energy contours, as well as spectrum, cepstrum, perceptual and temporal features, autocorrelation, MPEG-7 descriptors, Fujisakis model parameters, voice quality, jitter, and shimmer. Selected features are fed as input to K nearest neighborhood classifier and to support vector machines. Two kernels are tested for the latter: Linear and Gaussian radial basis function. The recently proposed speaker-independent experimental protocol is tested on the Berlin emotional speech database for each gender separately. The best emotion recognition accuracy, achieved by support vector machines with linear kernel, equals 87.7%, outperforming state-of-the-art approaches. Statistical analysis is first carried out with respect to the classifiers error rates and then to evaluate the information expressed by the classifiers confusion matrices. © Springer Science+Business Media, LLC 2011

    DigiArt: towards a virtualization of Cultural Heritage

    Get PDF
    DigiArt is a Europe-wide project aimed at providing a new, cost efficient solution to the capture, processing and display of cultural artefacts. The project will change the ways in which the public interact with cultural objects and spaces in a dramatic way. This project is unique in its collaborative approach: cultural heritage professionals working directly with electrical, mechanical, optical and software engineers to develop a solution to current issues faced by the museum sector. The innovations created by the engineers are driven by the demand of the cultural heritage sector. The diversity of the objects and spaces of the three test museums are challenging the engineers to provide a tool useful for a broad variety of indoor and outdoor museums in the future. This goes from using Unmanned Aerial Vehicle (UAVs or drones) to fly and record large sites, to using scanners to record fine jewellery. As a case study, we present here the use-case of Scladina Cave. At the end of the project, the Scladina Cave Archaeological Centre will offer two different visitor experiences. The first uses virtual reality, which will be available anytime, anywhere, to anyone with an internet connected device. The second will use augmented reality technologies within the cave site. The augmented reality visit of the cave will enhance the tour of Scladina by offering visits that would not be possible where it not for the augmented reality, where 3D objects and animations will contribute to offer a new 3D-immersive experience

    Tracking the Expression of Annoyance in Call Centers

    Get PDF
    Machine learning researchers have dealt with the identification of emo- tional cues from speech since it is research domain showing a large number of po- tential applications. Many acoustic parameters have been analyzed when searching for cues to identify emotional categories. Then classical classifiers and also out- standing computational approaches have been developed. Experiments have been carried out mainly over induced emotions, even if recently research is shifting to work over spontaneous emotions. In such a framework, it is worth mentioning that the expression of spontaneous emotions depends on cultural factors, on the particu- lar individual and also on the specific situation. In this work, we were interested in the emotional shifts during conversation. In particular we were aimed to track the annoyance shifts appearing in phone conversations to complaint services. To this end we analyzed a set of audio files showing different ways to express annoyance. The call center operators found disappointment, impotence or anger as expression of annoyance. However, our experiments showed that variations of parameters derived from intensity combined with some spectral information and suprasegmental fea- tures are very robust for each speaker and annoyance rate. The work also discussed the annotation problem arising when dealing with human labelling of subjective events. In this work we proposed an extended rating scale in order to include anno- tators disagreements. Our frame classification results validated the chosen annota- tion procedure. Experimental results also showed that shifts in customer annoyance rates could be potentially tracked during phone callsSpanish Mineco under grant TIN2014- 54288-C4-4-R H2020 EU under Empathic RIA action number 769872
    corecore